-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
levelzero: only use Sysman queries instead of similar Core API queries #595
Conversation
f04d462
to
ecf31fb
Compare
d978732
to
980e829
Compare
This should be rebased/simplified on top of #695 |
@saik-intel Now that we use zesInit(), I guess there's no reason to query both zeDevicePciGetPropertiesExt() and zesDevicePciGetProperties() in case one fails but no the other. The later should always be supported now, right? |
980e829
to
c2b2cfd
Compare
@TApplencourt Could you please run |
Of course! Here it's: At first glance, Everything looks good to me.
But Make check reported an error:
Composite or FLat doesn't change anything:
Using:
|
Thanks. There's at least one bug in the test file, I'll try to debug more your report. Aside of that it looks like ZES fails to report a valid PCI link speed (it shows 0.25GB/s instead of 63 in your case, and nothing on my machines). I'll revert that part and use ZE for this for now. |
Could you comment out the assert on line 188 of tests/hwloc/levelzero.c, make -C tests/hwloc levelzero && tests/hwloc/levelzero? I'd like to confirm that devices are reported by ZES and ZE in different orders. That'd be funny, but easy to fix. |
With that:
Commenting this line makes the tests pass:
Regarding the order, on Aurora we set In quote the doc of But that didn't make the "original" tests pass (aka without commenting the assert the test still fail even when And yes, I see nothing in the spec saying that ZE and ZES devices will be returned in the same order ( and ZE device can be masked via ZE_AFFINITY_MASK so 🤷🏽 ). I did some quick experiment, it's indeed not the case. On the second, I enumerate zes and print the serial number. The order change indeed.
Sorry for the poor code quality bellow if you want to try to reproduce (but I guess you don't have access to a PVC :(:
|
Didn't seem to break anything so far. Signed-off-by: Brice Goglin <[email protected]>
ZE and ZES may return devices in different orders. open-mpi#595 (comment) Signed-off-by: Brice Goglin <[email protected]>
Refs open-mpi#698 Signed-off-by: Brice Goglin <[email protected]>
Now that zesInit() is mandatory, don't bother falling back to the core API, Sysman shouldn't fail. Signed-off-by: Brice Goglin <[email protected]>
We don't need it anymore. Signed-off-by: Brice Goglin <[email protected]>
Signed-off-by: Brice Goglin <[email protected]>
c2b2cfd
to
f0a69cd
Compare
Thanks a lot @TApplencourt! |
ZE and ZES may return devices in different orders. #595 (comment) Signed-off-by: Brice Goglin <[email protected]> (cherry picked from commit efcd681)
This goes on top of #594 which might be merged once a compute-runtime release brings a working zesInit(). This PR requires zesInit() to be more widely available, especially the last commit which completely remove the old setting of ZES_ENABLE_SYSMAN=1 in the env.Now that we require zesInit(), just use ZES queries instead of switching from/to the Core API depending on ZES being available or not.